Generalized Score Matching for Non-Negative Data

A common challenge in estimating parameters of probability density functions is the intractability of the normalizing constant. While in such cases maximum likelihood estimation may be implemented using numerical integration, the approach becomes computationally intensive. The score matching method of Hyvärinen (2005) avoids direct calculation of the normalizing constant and yields closed-form estimates for exponential families of continuous distributions over Rm. Hyvärinen (2007) extended the approach to distributions supported on the non-negative orthant, R+m. In this paper, we give a generalized form of score matching for non-negative data that improves estimation efficiency. As an example, we consider a general class of pairwise interaction models. Addressing an overlooked inexistence problem, we generalize the regularized score matching method of Lin et al. (2016) and improve its theoretical guarantees for non-negative Gaussian graphical models.


Introduction
Score matching was first developed in Hyvärinen [2005] for continuous distributions supported on all of R m . Consider such a distribution P 0 , with density p 0 and support equal to R m . Let P be a family of distributions with twice continuously differentiable densities. The score matching estimator of p 0 using P as a model is the minimizer of the expected squared 2 distance between the gradients of log p 0 and a log-density from P. So we minimize the loss R m p 0 (x) ∇ log p(x) − ∇ log p 0 (x) 2 2 dx with respect to densities p from P. The loss depends on p 0 , but integration by parts can be used to rewrite it in a form that can be approximated by averaging over the sample without knowing p 0 .
A key feature of score matching is that normalizing constants cancel in gradients of log-densities, allowing for simple treatment of models with intractable normalizing constants. For exponential families, the loss is quadratic in the canonical parameter, making optimization straightforward.
If the considered distributions are supported on a proper subset of R m , then the integration by parts arguments underlying the score matching estimator may fail due to discontinuities at the boundary of the support. For data supported on the non-negative orthant R m + , Hyvärinen [2007] addresses this problem by modifying the loss to R m p 0 (x) ∇ log p(x)•x−∇ log p 0 (x)•x 2 2 dx, where • denotes entrywise multiplication. In this loss, boundary effects are dampened by multiplying gradients elementwise with the identity functions x j . In this paper, we propose generalized score matching methods that are based on elementwise multiplication with functions other than x j . As we show, this can lead to drastically improved estimation accuracy, both theoretically and empirically. To demonstrate these advantages, we consider a family of graphical models on R m + , which does not have tractable normalizing constants and hence serves as a practical example.
Graphical models specify conditional independence relations for a random vector X = (X i ) i∈V indexed by the nodes of a graph [Lauritzen, 1996]. For undirected graphs, variables X i and X j are required to be conditionally independent given (X k ) k =i,j if there is no edge between i and j. The smallest undirected graph with this property is the conditional independence graph of X. Estimation of this graph and associated interaction parameters has been a topic of continued research as reviewed by Drton and Maathuis [2017].
Largely due to their tractability, Gaussian graphical models (GGMs) have gained great popularity. The conditional independence graph of a multivariate normal vector X ∼ N (µ, Σ) is determined by the inverse covariance matrix K ≡ Σ −1 , also termed concentration or precision matrix. Specifically, X i and X j are conditionally independent given all other variables if and only if the (i, j)-th and the (j, i)-th entries of K are both zero. This simple relation underlies a rich literature including Drton and Perlman [2004], Meinshausen and Bühlmann [2006], Yuan and Lin [2007] and Friedman et al. [2008], among others.
More recent work has provided tractable procedures also for non-Gaussian graphical models. This includes Gaussian copula models [Liu et al., 2009, Dobra and Lenkoski, 2011, Liu et al., 2012, Ising models [Ravikumar et al., 2010], other exponential family models [Chen et al., 2015, Yang et al., 2015, as well as semi-or non-parametric estimation techniques [Fellinghauer et al., 2013, Voorman et al., 2014. In this paper, we apply our method to a class of pairwise interaction models that generalizes non-negative Gaussian random variables, as recently considered by Lin et al. [2016] and Yu et al. [2016], as well as square root graphical models proposed by Inouye et al. [2016] when the sufficient statistic function is a pure power. However, our main ideas can also be applied for other classes of exponential families whose support is restricted to a rectangular set.
Our focus will be on pairwise interaction power models with probability distributions having (Lebesgue) densities proportional to Here a > 0 and b ≥ 0 are known constants, and K ∈ R m×m and η ∈ R m are unknown parameters of interest. When b = 0 we define (x b − 1)/b ≡ log x and R m + ≡ (0, ∞) m . This class of models is motivated by the form of important univariate distributions for non-negative data, including gamma and truncated normal distributions. It provides a framework for pairwise interaction that is concrete yet rich enough to capture key differences in how densities may behave at the boundary of the non-negative orthant, R m + . Moreover, the conditional independence graph of a random vector X with distribution as in (1) is determined just as in the Gaussian case: X i and X j are conditionally independent given all other variables if and only if κ ij = κ ji = 0 in the interaction matrix K. Section 5.1 gives further details on these models. We will develop estimators of (η, K) in (1) and the associated conditional independence graph using the proposed generalized score matching.
A special case of (1) are truncated Gaussian graphical models, with a = b = 1. Let µ ∈ R m , and let K be a positive definite matrix. Then a non-negative random vector X follows a truncated normal distribution for mean parameter µ and inverse covariance parameter K, in symbols X ∼ TN(µ, K), if it has density proportional to on R m + . We refer to Σ = K −1 as the covariance parameter of the distribution, and note that the η parameter in (1) is Kµ. Another special case of (1) is the exponential square root graphical models in Inouye et al. [2016], where a = b = 1/2. Lin et al. [2016] estimate truncated GGMs based on Hyvärinen's modification, with an 1 penalty on the entries of K added to the loss. However, the paper overlooks the fact that the loss can be unbounded from below in the high-dimensional setting even with an 1 penalty, such that no minimizer may exist. Since the unpenalized loss is quadratic in the parameter to be estimated, we propose modifying it by adding small positive values to the diagonals of the positive semi-definite matrix that defines the quadratic part, in order to ensure that the loss is bounded and strongly convex and admits a unique minimizer. We apply this to the estimator for GGMs considered in Lin et al. [2016], which uses score-matching on R m , and to the generalized score matching estimator for pairwise interaction power models on R m + proposed in this paper. In these cases, we show, both empirically and theoretically, that the consistency results still hold (or even improve) if the positive values added are smaller than a threshold that is readily computable.
The rest of the paper is organized as follows. Section 2 introduces score matching and our proposed generalized score matching. In Section 3, we apply generalized score matching to exponential families, with univariate truncated normal distributions as an example. Regularized generalized score matching for graphical models is formulated in Section 4. The estimators applied to the pairwise interaction power models are shown in Section 5, while theoretical consistency results are presented in Section 6, where we treat the probabilistically most tractable case of truncated GGMs. Simulation results are given in Section 7. Proofs for theorems in Sections 2-6 are presented in Appendices A and B. Additional experimental results are presented in Appendix C.

Notation
Constant scalars, vectors, and functions are written in lower-case (e.g., a, a), random scalars and vectors in upper-case (e.g., X, X). Regular font is used for scalars (e.g. a, X), and boldface for vectors (e.g. a, X). Matrices are in upright bold, with constant matrices in upper-case (K, M) and random matrices holding observations in lower-case (x, y). Subscripts refer to entries in vectors and columns in matrices. Superscripts refer to rows in matrices. So X j is the j-th component of a random vector X. For a data matrix x ∈ R n×m , each row comprising one observation of m variables/features, X (i) j is the j-th feature for the i-th observation. Stacking the columns of a matrix K = [κ ij ] i,j ∈ R q×r gives its vectorization vec(K) = (κ 11 , . . . , κ q1 , κ 12 , . . . , κ q2 , . . . , κ 1r , . . . , κ qr ) . For a matrix K ∈ R q×q , diag(K) ∈ R q denotes its diagonal, and for a vector v ∈ R q , diag(v) is the q × q diagonal matrix with diagonals v 1 , . . . , v q .
For a function f : R m → R, we define ∂ j f (x) as the partial derivative with respect to x j , and ∂ jj f (x) = ∂ j ∂ j f (x). For f : R → R m , f (x) = (f 1 (x), . . . , f m (x)) , we let f (x) = (f 1 (x), . . . , f m (x)) be the vector of derivatives. Likewise f (x) is used for second derivatives. The symbol 1 A (·) denotes the indicator function of the set A, while 1 n ∈ R n is the vector of all 1's. For a, b ∈ R m , a • b ≡ (a 1 b 1 , . . . , a m b m ) . A density of a distribution is always a probability density function with respect to Lebesgue measure. When it is clear from the context, E 0 denotes the expectation under a true distribution P 0 .
2 Score Matching 2.1 Original Score Matching Let X be a random vector taking values in R m with distribution P 0 and density p 0 . Let P be a family of distributions of interest with twice continuously differentiable densities supported on R m . Suppose P 0 ∈ P. The score matching loss for P ∈ P, with density p, is given by The gradients in (3) can be thought of as gradients with respect to a hypothetical location parameter, evaluated at the origin [Hyvärinen, 2005]. The loss J(P ) is minimized if and only if P = P 0 , which forms the basis for estimation of P 0 . Importantly, since the loss depends on p only through its log-gradient, it suffices to know p up to a normalizing constant. Under mild conditions, (3) can be rewritten as plus a constant independent of p. The integral in (4) can be approximated by a sample average; this alleviates the need for knowing the true density p 0 , and provides a way to estimate p 0 .

Generalized Score Matching for Non-Negative Data
When the true density p 0 is supported on a proper subset of R m , the integration by parts underlying the equivalence of (3) and (4) may fail due to discontinuity at the boundary. For distributions supported on the non-negative orthant, R m + , Hyvärinen [2007] addressed this issue by instead minimizing the non-negative score matching loss This loss can be motivated via gradients with respect to a hypothetical scale parameter [Hyvärinen, 2007]. Under mild conditions, J + (P ) can again be rewritten in terms of an expectation of a function independent of p 0 , thus allowing one to form a sample loss.
In this work, we consider generalizing the non-negative score matching loss as follows.
Definition 1. Let P + be the family of distributions of interest, and assume every P ∈ P + has a twice continuously differentiable density supported on R m + . Suppose the m-variate random vector X has true distribution P 0 ∈ P + , and let p 0 be its twice continuously differentiable density. Let h 1 , . . . , h m : R + → R + be a.s. positive functions that are absolutely continuous in every bounded subinterval of R + (⇒differentiable a.s. in R + ), and set h(x) = (h 1 (x 1 ), . . . , h m (x m )) . For P ∈ P + with density p, the generalized h-score matching loss is where Proposition 1. The distribution P 0 is the unique minimizer of J h (P ) for P ∈ P + .
Choosing all h j (x) = x 2 recovers the loss from (5). In our generalization, we will focus on using functions h j that are increasing but are bounded or grow rather slowly. This will alleviate the need to estimate higher moments, leading to better practical performance and improved theoretical guarantees. We note that our approach could also be presented in terms of transformations of data; compare to Section 11 in Parry et al. [2012]. In particular, log-transforming positive data into all of R m and then applying (3) is equivalent to (5).
We will consider the following assumptions: , "∀p ∈ P + " is a shorthand for "for all p being the density of some P ∈ P + ", and the prime symbol denotes component-wise differentiation. While the second half of (A2) was not made explicit in Hyvärinen [2005Hyvärinen [ , 2007, (A1)-(A2) were both required for integration by parts and Fubini-Tonelli to apply. Once the forms of p 0 and p are given, sufficient conditions for h for Assumptions (A1)-(A2) to hold are easy to find. In particular, (A1) and (A2) are easily satisfied and verified for exponential families.
Integration by parts yields the following theorem which shows that J h from (6) is an expectation (under P 0 ) of a function that does not depend on p 0 , similar to (4). The proof is given in Appendix A.
Given a data matrix x ∈ R n×m with rows X (i) , we define the sample version of (7) aŝ Subsequently, for a distribution P with density p, we let J h (p) ≡ J h (P ). Similarly, when a distribution P θ with density p θ is associated to a parameter vector θ, We apply similar conventions to the sample versionĴ h (P ). We note that this type of loss is also treated in slightly different settings in Parry [2016] and Almeida and Gidas [1993].

Exponential Families
In this section, we study the case where P + ≡ {p θ : θ ∈ Θ} is an exponential family comprising continuous distributions with support R m + . More specifically, we consider densities that are indexed by the canonical parameter θ ∈ R r and have the form where t(x) ∈ R r + comprises the sufficient statistics, ψ(θ) is a normalizing constant depending on θ only, and b(x) is the base measure, with t and b a.s. differentiable with respect to each component.
Theorem 2. Under Assumptions (A1)-(A2) from Section 2.2, the empirical generalized h-score matching loss (8) can be rewritten as a quadratic function in θ ∈ R r : are sample averages of functions of the data matrix x only.
Then the minimizer of (10) is a.s. unique with closed-form solutionθ ≡ Γ(x) −1 g(x). Moreover, Theorems 2 and 3 are proved in Appendix A. Theorem 2 clarifies the quadratic nature of the loss, and Theorem 3 provides a basis for asymptotically valid tests and confidence intervals for the parameter θ. Note that Condition (C1) holds if and only if h j (X j ) > 0 a.s. and [t j (X (1) ), . . . , t j (X (n) )] ∈ R r×n has rank r a.s. for some j = 1, . . . , m.
Example 1. Univariate (m = r = 1) truncated normal distributions for mean parameter µ and variance parameter σ 2 have density If σ 2 is known but µ unknown, then writing the density in canonical form as in (9) yields Given an i.i.d. sample X 1 , . . . , X n ∼ p µ 0 ,σ 2 , the generalized h-score matching estimator of µ iŝ .
We recall that the Cramér-Rao lower bound (i.e. the lower bound on the variance of any unbiased estimator) for estimating µ is σ 4 var(X − µ 0 ) .
Example 2. Consider the univariate truncated normal distributions from (13) in the setting where the mean parameter µ is known but the variance parameter σ 2 > 0 is unknown. In canonical form as in (9), we write Given an i.i.d. sample X 1 , . . . , X n ∼ p µ,σ 2 0 , the generalized h-score matching estimator of σ 2 iŝ .
If, in addition to the assumptions in Example 1, lim .
In these examples, there is a benefit in using a bounded function h, which can be explained as follows. When µ σ, there is effectively no truncation to the Gaussian distribution, and our method adapts to using low moments in (6), since a bounded and increasing h(x) becomes almost constant as it reaches its asymptote for x large. Hence, we effectively revert to the original score matching (recall Section 2.1). In the other cases, the truncation effect is significant and our estimator uses higher moments accordingly. Figure 1 plots the asymptotic variance ofμ h from Example 1, with σ = 1 known. Efficiency as measured by the Cramér-Rao lower bound divided by the asymptotic variance is also shown. We see that two truncated versions of log(1 + x) have asymptotic variance close to the Cramér-Rao bound. This asymptotic variance is also reflective of the variance for smaller finite samples. Figure 2 is the analog of Figure 1 forσ 2 h from Example 2 with µ = 0.5 known. While the specifics are a bit different the benefits of using bounded or slowly growing h are again clear. We note that when σ is small, the effect of truncation to the positive part of the real line is small.
In both plots we order/color the curves based on their overall efficiency, so they have different colors in one from the other, although the same functions are presented. For all functions presented here (A1)-(A2) and (C1)-(C2) are satisfied.

Regularized Generalized Score Matching
In high-dimensional settings, when the number r of parameters to estimate may be larger than the sample size n, it is hard, if not impossible, to estimate the parameters consistently without turning to some form of regularization. More specifically, for exponential families, condition (C1) in Section 3 fails when r > n. A popular approach is then the use of 1 regularization to exploit possible sparsity.

Ensuring Bounded 1 -Regularized Loss
Let the data matrix x ∈ R n×m comprise n i.i.d. samples from distribution P 0 . Assume P 0 has density p 0 belonging to an exponential family P + ≡ {p θ : θ ∈ Θ}, where Θ ⊆ R r . Adding an 1 penalty to (10), we obtain the regularized generalized score matching loss as in Lin et al. [2016]. The loss in (14) involves a quadratic smooth part as in the familiar lasso loss for linear regression. However, although the matrix Γ is positive semidefinite, the regularized  loss in (14) is not guaranteed to be bounded unless the tuning parameter λ is sufficiently largea problem that does not occur in lasso. We note that here, and throughout, we suppress the dependence on the data x for Γ(x), g(x) and derived quantities. For a more detailed explanation, note that that by (11), Γ = H H for some H ∈ R nm×r . In the high-dimensional case, the rank of Γ, or equivalently H, is at most nm < r. Hence, Γ is not invertible and g does not necessarily lie in the column span of Γ. Let Ker(Γ) be the kernel of Γ. Then there may exist ν ∈ Ker(Γ) with g ν = 0. In this case, if 0 ≤ λ < sup ν∈Ker(Γ) |g ν|/ ν 1 , there exists ν ∈ Ker(Γ) with 1 2 ν Γν = 0 and −g ν + λ ν 1 < 0. Evaluating at θ(a) = a · ν for scalar a > 0, the loss becomes a −g ν + λ ν 1 , which is negative and linear in a, and thus unbounded below. In this case no minimizer of (14) exists for small values of λ. This issue also exists for the estimators from Zhang and Zou [2014] and Liu and Luo [2015], which correspond to score matching for GGMs. We note that in the context of estimating the interaction matrix in pairwise models, r = m 2 ; thus, the condition nm < r reduces to n < m, or n < m + 1 when both K and η are estimated.
To circumvent the unboundedness problem, we add small values γ > 0 to the diagonal entries of Γ, which become Γ , + γ , = 1, . . . , r. This is in the spirit of work such as Ledoit and Wolf [2004] and corresponds to an elastic net-type penalty [Zou and Hastie, 2005] with weighted 2 penalty r =1 γ θ 2 . After this modification, Γ is positive definite, our regularized loss is strongly convex in θ, and a unique minimizer exists for all λ ≥ 0. For the special case of truncated GGMs, we will show that a result on consistent estimation holds if we choose γ = δ 0 Γ , for a suitably small constant δ 0 > 0, for which we propose a particular choice to avoid tuning. This choice of γ depends on the data through Γ , .

Computational Details
We optimize our loss functions with respect to a symmetric matrixK and in the non-centered case also the vectorη. We use a coordinate-descent method analogous to Algorithm 2 in Lin et al. [2016], where in each step we update each element ofK andη based on the other entries from the previous steps, while maintaining symmetry. To speed up estimation, we use warm starts using the solution from the previous λ, as well as lasso-type strong screening rules [Tibshirani et al., 2012] to eliminate variables that are known a priori to have zero estimates. In our simulations in Section 7 we always scale the data matrix by column 2 norms before proceeding to estimation. 5 Score Matching for Graphical Models for Non-negative Data

A General Framework of Pairwise Interaction Models
We consider the class of pairwise interaction power models with density introduced in (1). We recall the form of the density: where a and b are known constants, and the interaction matrix K and the vector η are parameters. When b = 0, we use the convention that x 0 −1 0 ≡ log x and apply the logarithm element-wise. Our focus will be on the interaction matrix K that determines the conditional independence graph through its support S(K) ≡ {(i, j) : κ ij = 0}. However, unless η is known or assumed to be zero, we also need to estimate η as a nuisance parameter.
We first give a set of sufficient conditions for the density to be valid, i.e., the right-hand side of (16) to be integrable. The proof is given in the appendix.
In the non-centered case, if (CC1) and one of (CC2) and (CC3) holds, then the function on the right-hand side of (16) is integrable over R m + . In the centered case, (CC1) and a > 0 are sufficient.
We emphasize that (CC1) is a weaker condition than positive definiteness. Criteria for strict co-positivity are discussed in Väliaho [1986].

Implementation for Different Models
In this section we give some implementation details for the regularized generalized h-score matching estimator defined in (15) applied to the pairwise interaction models from (16). We again let Ψ ≡ K , η ∈ R (m+1)×m . The unregularized loss is then The general form of the matrix Γ and the vector g in the loss were given in equations (10)-(12).
where the • product between a vector and a matrix means an elementwise multiplication of the vector with each column of the matrix, and h j ( where g 1 and g 2 correspond to each entry of K and η, respectively. The j-th column of g 1 ∈ R m×m is where e j,m is the m-vector with 1 at the j-th position and 0 elsewhere, and the j-th entry of g 2 ∈ R m is In the centered case where we know η 0 ≡ 0, we only estimate K ∈ R m×m , and Γ ∈ R m 2 ×m 2 is still block-diagonal, with the j-th block being the Γ 11,j submatrix in (17), while g is just vec(g 1 ).
Since b only appears in the η part of the density, the formulae only depend on a in the centered case. These formulae also hold for b = 0 since Γ and g only depend on the gradient of the log density, and d( We emphasize that it is indeed necessary to introduce amplifiers γ 0 or a multiplier δ > 1 in addition to the 1 penalty. It is clear from (18) that rank(Γ j ) ≤ min{n, m + 1} (or min{n, m} if centered). Thus, Γ is non-invertible when n ≤ m (or n < m if centered) and g need not lie in its column span.
As Γ j itself is positive semidefinite, we find that if one of the first m entries of ν is nonzero then If only the last entry of ν is nonzero then . We conclude that Γ j,γ (and thus the entire amplified Γ) is a.s. positive definite, which ensures unique existence of the loss minimizer.
Given the formulae for Γ and g, one adds the 1 penalty on Ψ to get the regularized loss (24). Our methodology readily accommodates two different choices of the penalty parameter λ for K and η. This is also theoretically supported for truncated GGMs, since if the ratio of the respective values λ K and λ η is fixed, the proof of the theorems in Section 6 can be easily modified by replacing η by (λ η /λ K )η. To avoid picking two tuning parameters, one may also choose to remove the penalty on η altogether by profiling out η and solve forη ≡ Γ −1 22 g 2 − Γ 12 vec(K) , withK the minimizer of the profiled losŝ where the Schur complement Γ γ,11.2 ≡ Γ γ,11 − Γ 12 Γ −1 22 Γ 12 is a.s. positive definite such that the profiled estimator exists a.s. for all λ ≥ 0. This profiled approach corresponds to choosing λ η /λ K = 0. A detailed theoretical analysis of the profiled estimator is beyond the scope of this paper, however. We note that in the other extreme, with λ η /λ K = +∞, the non-centered estimator reduces to the estimator from the centered case.
Example 3. The truncated normal model comprises the density This corresponds to (16) Example 4. The exponential square-root graphical model in Inouye et al. [2016] has which corresponds to (16) with a = b = 1/2. We refer to this as the exponential model. In this case, the j-th Example 5. If a = 1/2 and b = 0, then (16) becomes If K is diagonal in this case, then X ∼ p η,K has independent entries with X j following the gamma distribution with rate κ jj and shape η j + 1, which gives an intuition for condition (CC3) η j > −1 in Theorem 4. We can thus view (23) as a multivariate gamma distribution with pairwise interactions among the covariates, and call this the gamma model. For this model, the j-th block of Γ is and the part of g corresponding to K j is We note that the Γ 11,j sub-matrix of Γ j and the g 1,j sub-vector of g j for the gamma model are the same as those for the exponential model, since a = 1/2 in both cases and the parts involving K in the densities are the same.

Requirements on h
In Section 2.2, we presented two assumptions (A1) and (A2) under which the generalized scorematching loss is valid, i.e., the integration by parts is justified and Theorem 1 holds. In this section, we present some sufficient (and nearly necessary) requirements on h such that (A1) and (A2) are satisfied.
. We write that h ∈ H a,b (for simplicity we omit the dependency on m) if for all j = 1, . . . , m: i) h j is absolutely continuous in every bounded sub-interval of R + , and thus has derivative h j a.s.; iii) h j and h j are both bounded by some piecewise powers of x a.s. on Theorem 5. Assume every P in the family of distribution P + satisfies (CC1)-(CC3) and thus has finite normalizing constants. If h ∈ H a,b , then (A1) and (A2) are satisfied.
In centered models, where η ≡ 0, we can assume b = 2a and iv) in the definition of H a,2a has q = 1 − a. For truncated GGMs, a = b = 1, so iv) in Definition 3 is simply lim In the case of b = 0, η is an unknown parameter, and (CC3) requires each of its component to be greater than −1. If one has prior information on η or restricts the parameter space for η, the requirement reduces to h j ( . Note that this is only a condition for x j 0 + , and the globally quadratic behavior of h j (x j ) = x 2 j from the original score matching is not needed on the entire R + , leaving opportunities for improvements.

Reasonable Choices of h
Assume a common univariate h for all components in h. Inspired by Theorem 5, we consider h that behaves like a value of x both as x +∞ and as x 0+. Since the requirements on the two tails are separate, we can choose h to be a piecewise defined function that joins two powers with possibly different degrees. In other words, h(x) = min(x p 1 , cx p 2 ) for some powers p 1 ≥ p 2 ≥ 0 and constant c > 0. Only one constant c is required since generalized score matching is invariant to scaling of h. In determining the exact power of p 1 we have the following considerations: a) In the centered case: (i) (A1) and (A2): Theorem 5 requires that p 1 ≥ 1 − a.
(ii) "Controlled Γ and g for x a ": We propose avoiding poles at the origin for the entries of Γ and g. The formula for Γ 11 in (18) shows that to this end h(x)x a−1 needs to have a non-negative degree . This requires b) In the non-centered case, in addition to (i) and (ii), (iii) (A1) and (A2): Theorem 5 requires p 1 ≥ max{1 − a, 1 − b} for b > 0, or 1 − min j η 0,j for b = 0.
(iv) "Controlled Γ and g for x b ": From the definition of Γ 22 and g 2 and by the same reasoning as above, The choice of p 2 , is only relevant for large data points. Our main consideration is then merely how well Γ and g concentrate on their true population values (Theorem 6). From this perspective, our intuition is that p 2 should be chosen small so that the tails of the distributions of the entries of Γ and g are well-behaved. Thus, we can choose p 2 = 0, in which case h(x) = min(x p 1 , c) is a truncated power.

Tuning Parameter Selection
By treating the unpenalized loss (i.e., λ = 0, γ = 0) as a negative log-likelihood, we may use the extended Bayesian Information Criterion (eBIC) to choose the tuning parameter Chen, 2008, Foygel andDrton, 2010]. Consider the centered case as an example. LetŜ λ ≡ {(i, j) :κ λ ij = 0, i < j}, whereK λ be the estimate associated with tuning parameter λ. The eBIC is then whereK can be either the original estimate associated with λ, or a refitted solution obtained by restricting the support toŜ λ . We use the eBIC instead of the ordinary BIC (Bayesian Information Criterion) since the BIC tends to choose an overly complex model when the model space is large, as encountered in the high-dimensional setting. The extension in eBIC comes from the last term in the above display which can be motivated by a prior distribution under which the number of edges in the conditional independence graph is uniformly distributed; see alsoŻak-Szatkowska and Bogdan [2011] and Barber and Drton [2015].

Theory for Graphical Models
Two innovations stand out in our regularized generalized score matching framework: We introduced the amplifiers/multipliers to address the inexistence problem, and we proposed using a general function h in place of x 2 . This section provides a theoretical analysis of these two aspects.
In Section 6.1, we present the theory for our regularized generalized score matching estimators for general pairwise interaction models before going into the details for the special cases of (truncated) GGMs. Next, we show that a specific choice of amplifiers/multipliers yields consistent estimation without the need for tuning. This point is important even in the case of Gaussian models on all of R m . Therefore, in Section 6.2 we digress from non-negative data and consider the original score matching of Hyvärinen [2005] for centered Gaussian distributions. Finally, in Section 6.3, we derive probabilistic results forΨ based on Theorem 6, justifying the benefits of using a general bounded h over x 2 in the non-negative setting. As the most important models from the class of pairwise interaction power models over R m + , we only treat truncated GGMs since they have the most tractable concentration bounds; this case also provides a comparison to Corollary 2 in Lin et al. [2016], which uses x 2 .

Theory for Pairwise Interaction Models
The graphical models we treat are parametrized by the interaction matrix K and the coefficients η on (x b − 1 m )/b. It is convenient to accommodate this setting with a matrix-valued parameter Ψ ∈ R r 1 ×r 2 (in place of θ) and specify our regularized h-score matching loss aŝ Consider the pairwise interaction power models in (1). In the case where we assume the linear part (x b − 1 m )/b is not present, or in other words η 0 ≡ 0 is known, we call the distribution (and the corresponding estimator) a centered distribution (estimator), in contrast to the general case termed non-centered when we assume η = 0.
In the non-centered case we thus take Ψ = [K, η] ∈ R m(m+1)×m . In the centered case, Ψ is simply the m × m interaction matrix K. Following related prior work such as Lin et al. [2016], for ease of proof we allow the matrix K to be nonsymmetric, which allows us to decouple optimization over the different columns of K or Ψ, while in our implementations we ensure that K is symmetric.
Definition 4. Let Γ 0 ≡ E 0 Γ(x) and g 0 ≡ E 0 g(x) be the population versions of Γ(x) and g(x) under the distribution given by a true parameter matrix Ψ 0 . The support of a matrix Ψ is S(Ψ) ≡ {(i, j) : ψ ij = 0}, and we let S 0 = S(Ψ 0 ). For a matrix Ψ 0 , we define d Ψ 0 to be the maximum number of non-zero entries in any column, and Finally, Γ 0 satisfies the irrepresentability condition with incoherence parameter α ∈ (0, 1] and edge Our analysis of the regularized generalized h-score matching estimator builds on the following theorem taken from Lin et al. [2016, Theorem 1].
Theorem 6. Suppose Γ 0 has Γ 0,S 0 S 0 invertible and satisfies the irrepresentability condition (26) with incoherence parameter α ∈ (0, 1]. Assume that then the following holds: (a) The regularized generalized h-score matching estimatorΨ minimizing (24) is unique, with supportŜ ≡ S(Ψ) ⊆ S 0 , and satisfies This result is deterministic, and the improvement of our generalized estimator over the one in Lin et al. [2016] is in its probabilistic guarantees, as shown for truncated GGMs in Theorems 8 and 9 in Section 6.3. Before going into these examples, we state a general corollary.

Revisiting Gaussian Score Matching
In this section we consider estimating the inverse covariance matrix K of a centered Gaussian distribution N(0, K), which of course has density proportional to (2) on all of R m . As shown, e.g., in Example 1 of Lin et al. [2016], the 1 -regularized score matching loss then takes the form which can be written as (14) with θ = vec(K), Γ = diag(xx , . . . , xx ) and g = vec(I m ). Thus, in general, the kernel of Γ need not be orthogonal to g, and for λ small the loss can be unbounded below as discussed above. Hence, an amplifier/multiplier on the diagonals of Γ is needed. We have the following theorem on the estimator using the amplification.
Theorem 7. Suppose the data matrix x holds n i.i.d. copies of X ∼ N(0, K 0 ). Adopt the amplifying in Section 4.1 and redefine the loss in (28) as where 1 < δ < 2 − 1 + 80 log m/n −1 . LetK be the resulting estimator. Let c * ≡ 12800 (max j Σ 0,jj ) 2 and c 1 = 4c Γ 0 /α. If for some τ > 2, the regularization parameter and the sample size satisfy In Corollary 1 of Lin et al. [2016] the same results were shown with c * ≡ 3200 (max j Σ 0,jj ) 2 when a unique minimizer exists, but the existence was not guaranteed.

Generalized Score Matching for Truncated GGMs
Next, we provide theory for the regularized generalized h-score matching estimatorΨ in the special case of truncated GGMs.
Suppose that the Γ 0,S 0 S 0 block of Γ 0 is invertible and Γ 0 satisfies the irrepresentability condition (26) with α ∈ (0, 1] and true edge set S 0 . Define c X ≡ 2 max j 2 (K −1 0 ) jj + √ e E 0 X j . If for τ > 3 the sample size and the regularization parameter satisfy then the following statements hold with probability 1 − m 3−τ : (a) The regularized generalized h-score matching estimatorK that minimizes (24) is unique, has its support included in the true support,Ŝ ≡ S(K) ⊆ S 0 , and satisfies The theorem is proved in Appendix A, where details on the dependencies on constants are provided. A key ingredient of the proof is a tail bound on Γ γ − Γ 0 ∞ , which features products of the X (i) j 's. In Lin et al. [2016], the products are up to fourth order. Using bounded h, our products automatically calibrates to a quadratic polynomial when the observed values are large, and resort to higher moments only when they are small. This leads to improved bounds and convergence rates, underscored in the new requirement on the sample size n, which should be compared to Lin et al. [2016]. For the non-centered case, by definition, The proof given for Theorem 8 goes through again here, and we have the following consistency results.
Theorem 9. Suppose the data matrix holds n i.i.d. copies of X ∼ TN(µ 0 , K 0 ). Assume that h ∈ H 1,1 and that 0 ≤ h ≤ M , 0 ≤ h ≤ M a.s. for constants M, M . Let γ be a vector of amplifiers that are non-zero only for the diagonal entries of the matrices Γ 11,j , amplifying those by Suppose further that Γ 0,S 0 S 0 is invertible and satisfies the irrepresentability condition (26) with Suppose for τ > 3 the sample size and the regularization parameter satisfy where c Γ 0 ,Ψ 0 is c Γ 0 as in (25) but with notation Ψ 0 to differentiate it from the centered case. Then the following statements hold with probability 1 − m 3−τ : (a) The regularized generalized h-score matching estimatorΨ that minimizes (24) is unique, has its support included in the true support,Ŝ ≡ S(Ψ) ⊆ S 0 , and satisfies thenŜ = S 0 and sign(κ jk ) = sign(κ 0,jk ) for all (j, k) ∈ S 0 and sign(η j ) = sign(η 0j ) for (m + 1, j) ∈ S 0 .
Remark 2. The quantity c X in Theorem 9 depends on E 0 X j , which in turn depends on the structure of both µ 0 and K 0 . If µ 0,j is large compared to (K 0 ) −1 jj , then c X seems to scale as µ 0 , which negatively impacts the guarantees stated in Theorem 9. However, as in the one-dimensional case for estimation of µ 0 (Example 1), our estimator should automatically adapt to the large mean parameter. This suggests that it might be possible to improve our analysis involving c X .

Numerical Experiments
In this section, we compare the performance of our estimator with different choices of h to the existing approaches for pairwise interaction power models. In our simulation experiments, we consider m = 100 variables and n = 80 and n = 1000 samples, corresponding to high-and lowdimensional settings. We also tried intermediate sample sizes between these two extremes, but found no interesting result worth reporting. For n = 80, amplification is necessary. Except in Section 7.2.2, the amplifier is set based on Theorem 8 to δ = C(n, m) = 1.8647 for truncated GGMs. The same amplifier is also used for settings with other a and b. For n = 1000, we consider δ = 1, i.e., no amplification, and δ = C(n, m) = 1.6438 (again, based on Theorem 8).

Structure of K
The underlying interaction matrices are selected as follows: Proceeding as in Section 4.2 of Lin et al. [2016], the graph is chosen to have 10 disconnected subgraphs, each containing m/10 nodes. Thus, K 0 is block-diagonal. In each block, each lower-triangular element is set to 0 with probability 1 − π for some π ∈ (0, 1), and is otherwise drawn from Uni[0.5, 1]. The upper triangular elements are determined by symmetry. The diagonal elements of K 0 are chosen as a common positive value such that the minimum eigenvalue of K 0 is 0.1.
We generate 5 different true precision matrices K 0 , and run 10 trials with each of these precision matrices. For n = 1000, we choose π = 0.8, which is in accordance with Lin et al. [2016]. For n = 80, we set π = 0.2. This way n/(d 2 K 0 log m) is roughly constant; recall Theorems 8 and 9 for truncated GGMs.
In Appendix C, we present results on Erdös-Rényi graphs. Other constructions gave similar conclusions and are hence not included.

Choice of h
Our estimator requires choosing a function h : R m + → R m + . For simplicity, we will always specify h(x) = (h(x 1 ), . . . , h(x m )) for a single non-decreasing univariate function h : R + → R + , i.e. all coordinates share the same h function.
As previously explained, h ∈ H a,b is a sufficient condition for assumptions (A1)-(A2), as well as (C1)-(C2) in the case of unregularized estimators. Only in the proofs of our theoretical guarantees in Section 6 for truncated GGMs, did we require h to be bounded and to have bounded derivatives. As motivated by the discussion in Section 5.3.2, we consider truncated and untruncated powers, min(x, c) and x (since 2 − a = 2 − b = 1); we evaluate this choice by contrasting them with powers x 1.5 and x 2 . We also explore functions like log(1 + x) that seem natural and are linear near x 0 + . In particular, we make a further comparison to functions that are linear near 0 and also have a finite asymptote but are differentiable everywhere: MCP- [Fan and Li, 2001] and SCAD-like [Zhang, 2010] functions defined below. The results reported here are based on selections of best performing choices of h.
We do not observe any clear relationship between features such as convexity or the slope of h at the origin and performance of the estimator. Nonetheless, for many choices of rather simple functions h, our estimator provides a significant improvement over existing methods. In particular, most h functions that behave linearly for small x, namely log(1 + x) and x and their truncations, and additionally MCP and SCAD, always perform better than x 1.5 and x 2 . This agrees with our discussion in Section 5.3.2, where 2 − a = 1 is a reasonable choice of the power for small x; also see Section 7.3. However, we conclude that there is no real gain from making the function smoother by using MCP or SCAD.
Truncated Centered GGMs For data from a truncated centered Gaussian distribution, we compare our generalized score matching estimator with various choices of h, to SpaCE JAM (SJ, Voorman et al., 2014), which estimates graphs using additive models for conditional means, a pseudo-likelihood method SPACE [Peng et al., 2009] in the reformulation of Khare et al. [2015], graphical lasso (GLASSO, Lin, 2007, Friedman et al., 2008), the neighborhood selection estimator (NS) of Meinshausen and Bühlmann [2006], and nonparanormal SKEPTIC [Liu et al., 2012] with Kendall's τ . Recall that the choice of h(x) = x 2 corresponds to the estimator from Lin et al. [2016]. The ROC (receiver operating characteristic) curves for different estimators are shown in Figure 3. Each plotted curve corresponds to the average of 50 ROC curves, where the averaging is based on the vertical averaging from Algorithm 3 in Fawcett [2006], and is mean AUC-preserving. To reduce clutter, we only report the results for the top performing competing methods. In particular, results for nonparanormal SKEPTIC are omitted, as the method always performs the worst in our experiments. The corresponding means and standard deviations of AUCs (areas under the curves) over 50 curves are given in Table 1.
Looking at the mean AUCs, with the standard deviations in mind, all choices of h considered here perform better than h(x) = x 2 from Hyvärinen [2007] and Lin et al. [2016] and the competing methods. The results for n = 1000 in Table 1 also show that the multiplier does help improve the AUCs, a matter to be discussed in Section 7.2.2. Truncated Non-Centered GGMs Next, we generate data from a truncated non-centered Gaussian distribution with both parameters µ and K unknown. In each trial, we form the true K 0 as in Section 7.1, and generate each component of µ 0 independently from the normal distribution with mean 0 and standard deviation 0.5.
We compare the performance of our profiled estimator based on (19), with different h functions, but with no penalty on η ≡ Kµ, to SPACE, SpaCE JAM (SJ), GLASSO, and neighborhood selection (NS). As before, we consider 50 trials. Representative ROC curves are plotted in Figure 4, and the corresponding AUCs are summarized in Table 2. Even without tuning the extra penalty parameter on η ≡ Kµ, our profiled estimator beats the competing methods by a large margin when n = 80. With multipliers 1 and n = 1000, our estimators still do better than Space JAM and GLASSO, and have performance comparable to other competing methods. It might appear that the performance of our estimators deteriorate with a multiplier larger than 1; however, as we will see, there can be significant improvement in AUCs if we tune an additional parameter for the multiplier. As in the centered case, the leading h functions in each category perform similarly, and the exact choice is not crucial. Subsequently, we will simply use h(x) = min(x, 3).

Choice of multiplier
Truncated Centered GGMs In Figure 5, the ROC curves for GLASSO, SPACE, and our estimator with h(x) = min(x, 3), but with different levels of amplification, via different choices of multipliers δ, are compared for the centered case of Section 7.2.1.
While Theorem 8 guarantees consistency only for δ < C(n, m), we observe that there can be a gain from going beyond the upper-bound multiplier C(n, m), which is 1.8647 for n = 80 and 1.6438 for n = 1000 (when n = 1000, C(n, m) turns out to be the best-performing multiplier). However, the effect deteriorates fast as the multiplier grows larger. The figure suggests that while some additional gains are possible by tuning over the choice of multiplier, the upper-bound multiplier is a good default. Truncated Non-Centered GGMs In Figure 6, we consider the non-centered case of Section 7.2.1, and use the non-profiled estimator; that is, the non-centered estimator with 1 penalty on both K and η ≡ Kµ. The ROC curves are compared to competing methods GLASSO and SPACE. For the choice of amplification in our estimator, we consider the upper-bound multiplier C(n, m) from Theorem 9 as the default. We refer to this as high amplification. We also consider certain lower amount of amplification, with δ = 2 − (1 + 24e log m/n) −1 , referred to as medium. For n = 1000, we also consider a low multiplier 1, which corresponds to no amplification. We compare these possible defaults, while varying the ratio between the penalty parameters on K and η, of which we show some representatives in the plots.
We see that among our defaults, the upper-bound choice C(n, m) performs best. Some additional gains are possible by tuning the multiplier over a grid of values containing this choice. Moreover, we see that it can be beneficial to tune over both λ K and λ K /λ η .
We remark that while for each run, the best model picked by BIC falls on the ROC curve, a few squares are off the curve in panel (c) of Figure 6. This is because these squares correspond to the average of the true and false positive rates of the chosen BIC models over all 50 runs, potentially due to multimodality of the distribution of the models. Nonetheless, in all cases, the average of the models picked by BIC tuned over both λ K and λ K /λ η looks reasonable.  Figure 6: Performance of the non-centered estimator with h(x) = min(x, 3). Each curve corresponds to a different choice of λ K /λ η . Squares indicate models picked by eBIC with refit. The square with black outline has the highest eBIC among all models (combinations of λ K , λ η ). Multipliers correspond to medium or high for n = 80, and low, medium or high for n = 1000, respectively.

Other a/b Models
We now turn to the non-Gaussian (a = 1 or b = 1) setting. Based on the observations in Section 5.3.2, we focus on functions of type min(x p , c) for some power p > 0 and truncation point c > 0. For simplicity, for the non-centered models we use the profiled estimator (19) (i.e., λ η = ∞) and use the multiplier C(n, m) in Theorem 8 for truncated GGMs as a guidance. We note that tuning over the λ η parameter and the multiplier can potentially give a significant improvement as seen in Section 7.2.
These simulations suggest that among the class of functions of the form min(x p , c), x 2−a or min(x 2−a , c) with a moderately large c can be used as the default choice of h(x). This agrees with our findings in Section 7.2.1. We note that bounded h functions were only used in the proof for truncated GGMs, and picking a moderately large truncation point can correspond to having an untruncated power.

Exponential Setting
For the exponential models, a = b = 1/2. Since a = b, for both centered and non-centered settings, based on the principle in Section 5.3.2, choosing h(x) = min(x 3/2 , c) satisfies (A1) and (A2) and also ensures that entries in Γ and g are bounded (for small x), while choosing h(x) = min( √ x, c) only guarantees (A1) and (A2).
In Figure 7, we present the AUCs for the ROC curves of edge recovery with different choices of h(x) = min(x pow , c). As before, we set n = 80 or 1000 and m = 100, but we use a η 0 with each component uniformly equal to −0.5, 0 or 0.5; for η 0 ≡ 0, we assume this information is known and use the centered estimator. The results suggest that pow = 3/2 = 2 − a is the best choice of power. For this optimal choice, the performance improves with larger c, so x 2−a gives the best results. For sub-optimal powers, including truncation gives better results.
The results are shown in Figure 8, where we consider n = 80, 1000, and η = ±0.51 100 . They suggest that pow = 2 − a = 1.5 works consistently well, although slightly outperformed by 1 and 1.25 in one case. As in the exponential case, with the optimal power it is beneficial to choose a large truncation point, or work with an untruncated power x 1.5 . We conclude that the performance is likely only dependent on the (2 − a) power requirement for the x a Kx a part or 2 − min j η 0,j ; simulations in the next section rule out the possibility of the latter.

Other Choices of a and b
In this section, we consider other choices of a and b. Specifically, a = 3/2 and b = 1/2 or 0. These combinations are chosen to confirm, in a more extreme setting, that the performance is mainly determined by requirements on the power based on a, which correspond to choosing a power of 1 − a or 2 − a, but not those on b or on η when b = 0 correspond to 1 − b and 2 − b. The relationship between these two settings is analogous to that between the exponential and gamma models (same a, b nonzero/zero).
The results are shown in Figures 9 and 10, and indeed confirm that x 2−a = x 0.5 consistently gives the optimal results, even though η x b is in favor of x 2−b = x 1.5 for b = 0.5, and η log(x) is in favor of x 2 or at least x 1−min j η 0,j when b = 0. There are two possible explanations for the optimality of 2 − a over max{2 − a, 2 − b} or max{2 − a, 1 − min j η 0,j }: (1) The AUC metric is measured only on our interest, edge recovery for the interaction matrix, which only depends on x a ; (2) using the profiled estimator weakens the effect of b.

Discussion
In this paper, we proposed a generalization of the score matching estimator of Hyvärinen [2007], based on scaling the log-gradients to be matched with a suitably chosen function h. The generalization retains the advantages of Hyvärinen's method: Estimates can be computed without knowledge of normalizing constants, and for canonical parameters of exponential families, the estimation loss is a quadratic function.
For high-dimensional exponential family graphical models, following Lin et al. [2016], we add an 1 penalty to regularize the generalized score matching loss. One practical issue that is overlooked in Lin et al. [2016] is the fact that the score matching loss can be unbounded below for a small tuning parameter, when the number of parameters m exceeds the sample size n. We fix this issue by amplifying the diagonal entries in the quadratic matrix in the definition of the generalized score matching loss by a factor/multiplier, and we give an upper bound on that multiplier that guarantees consistency. As examples we consider pairwise interaction power models on the non-negative orthant R m + . Specifically, the considerd models are exponential families in which the log density is the sum of pairwise interactions between entries in of powers x a plus linearly weighted effects x b , or log(x) when b = 0. Our main interest is in the matrix of interaction parameters whose support determines the distributions' conditional independence graph. The considered framework covers truncated normal distributions (a = b = 1), exponential square root graphical models (a = b = 1/2) from Inouye et al. [2016], as well as a class of multivariate gamma distributions (a = 1/2, b = 0).
In the case of multivariate truncated normal distributions, where the conditional independence graph is given by the underlying Gaussian inverse covariance matrix, the sample size required for the consistency of our method using bounded h is Ω(d 2 log m), where d is the degree of the graph. This matches the rates for Gaussian graphical models in Ravikumar et al. [2011] and Lin et al. [2016]. In contrast, the sample complexity for truncated Gaussian models given in Lin et al. [2016] is Ω(d 2 log 8 m).
For the considered class of pairwise interaction models, we recommend using the function h with coordinates h j (x) = min x 2−a , c for some moderately large c, or simply h j (x) = x 2−a . While this choice is effective, it would be an interesting problem for future work to develop a method that adaptively chooses an optimized function h from data.

A Proofs
A.1 Proof of Theorem 1 The following integration by parts lemma is used in the proof of Theorem 1.
Lemma 1. Let f, g : R + → R be functions that are absolutely continuous in every bounded subinterval of R + . Then Proof. This is an analog of Lemma 4 from Hyvärinen [2005] that can be proved by integrating the partial derivatives, and follows from the fundamental theorem of calculus for absolutely continuous functions and the product rule. In particular, we work on integrals in bounded [0, c], where the product of two absolutely continuous functions in a bounded interval is again absolutely continuous, and the result is then obtained by letting c +∞.
Proof of Theorem 1. Recall the following assumptions from Section 2.2: • h(X)) 1 < +∞, ∀p ∈ P + . Without explicitly writing the domains R + or R m + in all integrals, by (6) we have where A will simply appear in the final display as is, C is a constant as it only involves the true pdf p 0 , and we wish to simplify B by integration by parts. We can split the integral into these three parts since A and C are assumed finite in the first part of (A2), and the integrand in B is integrable since |2ab| ≤ a 2 + b 2 . Thus, by linearity and Fubini's theorem, we can write By the fact that ∂ log p 0 (x) ∂x j , this can be simplified to But, we assume p 0 and p are twice continuously differentiable, for every j = 1, . . . , m and fixed x −j ∈ R m−1 + . Hence, in every bounded sub-interval of R + , p 0 (x −j ; x j ) is an absolutely continuous function of x j , ∂ j log p(x −j , x j ) = ∂ j p(x −j , x j )/p(x −j , x j ) is a continuously differentiable (and hence absolutely continuous) function of x j by the quotient rule. Thus h j (x j )∂ j log p(x −j ; x j ) is also absolutely continuous by the absolute continuity assumption on h j . Then, by Lemma 1, where we take f ≡ p 0 (x −j ; x j ) and g ≡ h j (x j )∂ j log p(x −j ; x j ) as functions of x j , followed by assumption (A1), Justified by the second half of (A2), by Fubini-Tonelli and linearity again Thus, where C is a constant that does not depend on p.

A.2 Theorems and Examples in Section 3
Proof of Theorem 2. For exponential families and under the assumptions, the empirical lossĴ h (p θ ) in (8) becomes (up to an additive constant) Then we can writeĴ h (p θ ) = 1 2 θ Γ(x)θ − g(x) θ + const.
By integration by parts, (suppressing the dependence of p µ 0 on µ 0 ) where the last step follows from the assumptions lim The asymptotic variance is thus .
The Cramér-Rao lower bound follows from taking the second derivative of log p µ 0 with respect to µ 0 .
The Cramér-Rao lower bound follows from taking the second derivative of log p σ 2 0 with respect to σ 2 0 .

A.3 Theorems in Section 5
Proof of Theorem 4. Case b = 0: We use a strategy similar to that of Inouye et al. [2016]. Let V 1 = {v : v 1 = 1, v ∈ R m + }. Then by Fubini-Tonelli the normalizing constant is, Here V 1 is compact and the inner integral, if finite, is continuous in v. It thus suffices to show that the inner integral is finite at every single v ∈ V 1 . Fixing Recall that (CC1) v Kv > 0 for all v ∈ R m + \{0}, so A > 0.
(ii) Suppose B > 0. We first want to bound exp(−Az 2a + Bz b ) ≤ N 0 exp(−Az 2a /2) by some finite constant N 0 > 0, so that N (A, B, a, b) ≤ N 0 ∞ 0 exp(−Az 2a /2) dz, a finite constant for a > 0. Thus, it remains to give conditions so that exp(−Az 2a /2 + Bz b ) is bounded by some finite constant N 0 , which by continuity only requires a finite limit as z 0 and as z +∞. As z +∞, Bz b +∞, while −Az 2a /2 −∞. We thus need b < 2a so that the sum of the two does not go to positive infinity. On the other hand, as z 0, −Az 2a /2 0, so we need b > 0, otherwise z b +∞. In conclusion, we require that 2a > b > 0.
It thus suffices to require (CC1) and (CC2) 2a > b > 0 to eliminate restrictions on B, and hence on η. That is, η can take value in the entirety of R m .
where the integration follows by change of variable and requires a > 0. Assuming a > 0, the last quantity is finite if and only if η −1 m , by definition of the gamma function.
The centered settings, where the term involving x b is excluded, can be considered as a special case of both (1) and (2) with η ≡ 0, and thus (CC1) and a > 0 are sufficient.
Proof of Theorem 5. Recall assumptions (A1) and (A2): Let K 0 and η 0 be the true parameters so that p 0 ∈ P + , with P + corresponding to a parameter space in which all parameters satisfy the conditions for a finite normalizing constant. We now give sufficient conditions for h to satisfy (A1) and (A2).
Conditions for (A1): Fixing j = 1, . . . , m and x −j ∈ R m−1 where (1) Let x j +∞. If b > 0, since 2a > b > 0 and B 1 < 0, the exponential term in (37) decreases to 0 exponentially and its reciprocal dominates any polynomial functions. Thus, the entire product goes to 0 if h j (x j ) grows no faster than polynomially as x j +∞. If b = 0, the C 1 log x j term is again dominated by B 1 x 2a j /(2a), and the same conclusion holds.
(i) Let b > 0. Then the exponential term in (37) goes to 1, and we only need • If a > 1 and b > 1, the second term in (38)  ) as x j 0. Note that this is satisfied by any h j that has a finite right limit at 0.
• If a = 1 and b ≥ 1, or a ≥ 1 and b = 1, then the second term in (38) is a polynomial of non-negative power plus a potentially nonzero constant. A necessary and sufficient condition for (38) is thus lim x j 0 h j (x j ) = 0. • If a < 1 or b < 1, then the second part in (38) is a polynomial having terms with negative degree ≥ min{a − 1, b − 1}. To counteract this a necessary and sufficient condition is In conclusion, lim (ii) Now assume b = 0. Then, (37) now becomes With C 1 log x j dominating, the exponential part scales as x C 1 j . We thus require lim which by the previous discussion on (38) holds if and only if lim In summary, (A1) is satisfied if h j (x j ) grows at most polynomially as x j +∞, and lim Conditions for (A2): For (A2), we consider powers of x as the h functions for simplicity; conclusions for other functions that have the same tail behavior (big-O scaling) as x 0 and x +∞ follow similarly. Sufficiency results for piecewise power functions follow by partitioning, and similarly for other functions h whose function values and derivatives can be bounded by those of some piecewise power function (e.g. truncated powers), since (A2) is on integrability of products involving positive powers of h and h .
Let K 0 and η 0 be the true parameters from the parameter space that satisfies the conditions for finite normalizing constant. By part (2) of the proof of Theorem 4, the assumption that K 0 satisfies (CC1) implies that min v∈R m Then we have the following decomposition Then for any other K and η in the parameter space, for the first part of (A2) it suffices to show for any j = 1, . . . , m that D < ∞, where Thus, plugging this back in the definition of D, we can split D into a sum of six terms D 1 through D 6 , each of which is a sum of terms of the form times a constant involving K and η j , where pow i ≥ 0 for i = j. We have thus decomposed the integral into a product of univariate integrals. Note that is finite for all pow i ≥ 0 regardless of whether b is nonzero, since we assumed K 0 and η 0 to lie in the parameter space with a finite normalizing constant. Indeed, if b > 0 then the terms in the exponential is a regular polynomial with positive degree and a negative leading term; if b = 0 then integrability follows from η 0,i + pow i ≥ η 0,i > −1. Thus, we only need to consider the univariate integral that involve the x j terms, namely We split the integral into two parts over [0, 1] and [1, ∞], respectively.
• If b > 0, on [0, 1] the exponential part is bounded above and below by positive constants, and for (A1) we require h j (x) = o(x 1−min{a,b} ) as x 0 + , so the integrand is o(x min{a,b}−1 ) = o(x −1 ) and is thus integrable on [0, 1]. The integrand on [1, ∞) is integrable as in (A1) we assume h to grow at most polynomially.
We conclude that if the true and the proposed parameters give densities with finite normalizing constants, and if h satisfies assumption (A1), then (A2) is automatically satisfied.
In the centered case where we assume η ≡ 0, we only need lim x j 0 + h j (x j )/x 1−a j = 0 as it is a special case with b = 2a.
Proof of Theorem 8. The proof of Theorem 6 from Lin et al. [2016] does not rely on the fact that the original Γ is an unbiased estimator for the population Γ 0 , but instead only requires one to bound Γ − Γ 0 ∞ . Thus, for Γ γ = Γ + diag(γ), by Theorem 6 it suffices to prove that for any τ > 3, we can bound Γ(x)+diag(γ(x))−Γ 0 ∞ by some 1 and g(x)−g 0 ∞ by some 2 , uniformly with probability 1 − m 3−τ . Recall from (21) that the j th block of Γ γ ∈ R m 2 ×m 2 has (k, )-th entry The entry in g ∈ R m 2 (obtained by linearizing a m × m matrix) corresponding to (j, k) is Denote M ≡ max j sup x>0 h j (x) and M ≡ max j sup x>0 h j (x), and let c X ≡ 2 max j (2 Σ jj + √ e E 0 X j ). Using results for sub-Gaussian random variables from Lemma 4.2 in Appendix B, we have for any t 1 > 0, which is just a constant multiple of the (k, k)-th entry of Γ j itself, with the constant explicitly calculable and a function of p and n only, then for k = with m blocks. The j th block of Γ 11 ∈ R m 2 ×m 2 has (k, )-th entry On the other hand, g ∈ R (m 2 +m) is a rearrangement of g ( * ) ≡ [g 1 , g 2 ] , where the entry in g 1 ∈ R m 2 (obtained by linearizing a m × m matrix) corresponding to (j, k), is Recalling that the bounds in Lemma 4 also hold when µ = 0, we may then use bounds similar to those in the proof of Theorem 8, and use union bounds to arrive at analogous consistency results, modulus different constants. The amplifiers γ can be incorporated analogously.

B Auxiliary Lemmas and Definitions
In this appendix, to simplify notation, when it is clear from the context, the operator E is defined as the expectation under the true distribution, unless otherwise noted. The next lemma verifies (A1)-(A2) in the case of truncated normal distributions.
Lemma 2 (Assumptions for truncated normal). Consider a non-centered truncated normal distribution defined by Equation (20) with unknown positive definite inverse covariance parameter K 0 and unknown mean parameter µ 0 . Then assuming 0 ≤ h j ≤ M j , lim assumptions (A1)-(A2) for score matching are satisfied for any parameters K 0 and µ. Taking µ ≡ µ 0 ≡ 0 the assumptions also hold in the centered case. Choosing m = 1 gives the univariate case.
Proof of Lemma 2. Consider p ∼ TN(µ, K), with k j the j-th column of K. Let M ≡ max j M j and M ≡ max j M j .
(A1) For any fixed x −j ∈ R m−1 + and any p ∈ P + with parameters K and µ, for some constants C 1 , C 2 , C 3 , and C 4 depending on x −j , K 0 , K, µ 0 and µ. Since κ 0,jj > 0 and we assumed h j to be bounded, the limit equals to 0 for all j and x −j . Similarly, if and only if we assume lim (A2) For any p ∈ P + with parameters K and µ, since M , K, K 0 , µ, µ 0 are all finite constants. We also have Hence, (A1) and (A2) are both satisfied.
We remark that the definition of sub-Gaussian norm here allows the variable to be non-centered, and is different from the one in Vershynin [2012], which uses X q in the definition. Instead, it coincides with θ 2 in Buldygin and Kozachenko [2000]. The definition of the sub-Gaussian parameter is the same as in Buldygin and Kozachenko [2000], and the definition of the sub-exponential norm is as in Vershynin [2012].
Lemma 3 (Properties of Sub-Gaussian and Sub-Exponential Variables).
1) For any X and r = 1, 2, X − EX ψr ≤ 2 X ψr and X ψr ≤ X − EX ψr + |EX|, as long as the expectation and norms are finite.
7) [Boucheron et al., 2013] If for X i i.i.d. there exists some B > 0 such that then for all > 0,
6) This follows from Corollary 5.17 from Vershynin [2012]. 7) By Theorem 2.10 of Boucheron et al. [2013] wherein we let v ≡ nB 2 /2 and c ≡ B/2, we have for all > 0. (Theorem 2.10 gives an one-sided bound; bound for the other side is obtained by taking X i = −X i ). The inequality follows by splitting into cases ≤ B and > B.
Lemma 4. Suppose X follows a truncated normal distribution on R m + with parameters µ and Σ = K −1 0. Let X (1) , . . . , X (n) be i.i.d. copies of X, with j-th component of the i-th copy being X (i) j . Then 1. For j = 1, . . . , p, τ (X j − EX j ) ≤ Σ jj . That is, the sub-Gaussian parameter of any marginal distribution of X, after centering, is bounded by the square root of its corresponding diagonal entry in the covariance parameter Σ. Then, for any > 0, In particular, if h 0 is a function bounded by M 0 , then for any > 0, with exp − 1 2 x Σ −1 x log-concave in x and 1 R p + −µ (x+tΣ 1 ) log-concave in (x, t) since R p + −µ is a convex set. Since log-concavity is closed under multiplication and integration over R p , the integral is indeed log-concave, and our proof of the bound on the sub-Gaussian parameter of X j − EX j is complete. The tail bound follows from 5) of Lemma 3. Now by 1) and 2) of Lemma 3, If h 0 is a function bounded by M 0 , then by definition By 1) and 2) of Lemma 3 again, The tail bound thus follows from the first inequality using 5) of Lemma 3. By 2), 2. By the proof of 1) of this lemma, X j ψ 2 ≤ 2 Σ jj /e + EX j , and by 3) of Lemma 3, Since h 0 is a function bounded by M 0 , by definition Then by 1) of Lemma 3 again, The tail bound then follows from 6) of Lemma 3.
Although not used in our analysis for the consistency results, in the special case of h 0 ≡ 1, we also have the following lemma. The notable difference between bounds (49) below and (45) from Lemma 4.2 is in the constants and dependency on EX j : The constants in the denominator in the right-hand side of (45) is smaller and thus gives a tighter bound, but (49) is preferred when EX j is notably large compared to Σ jj , since the constant is only linear in EX j .
Proof of Lemma 5. We use a proof similar to Lemma 1 in Ravikumar et al. [2011] (note that the assumption that EX j = 0 in Ravikumar et al. [2011] does not hold in our case). Define , by union bound we have We next define Then since τ is a norm by 2) of Lemma 3, A jk is sub-Gaussian with parameter ≤ Σ jj + √ Σ kk , and B jk is sub-Gaussian with parameter ≤ 2(EX j + EX k ) Σ jj + √ Σ kk . Using 4) of Lemma 3 together with the inequality (a + b + c) q ≤ (3 max{a, b, c}) q ≤ 3 q (a q + b q + c q ) for all a, b, c ≥ 0 and q > 0, we have for any q ≥ 2 Using var(X + Y ) ≤ 2(var(X) + var(Y )) and the fact that var(X j ) = var(X j − EX j ) ≤ τ 2 (X j − EX j ) ≤ Σ jj (by 2) of Lemma 3 and part 1 of this theorem), we then have E|Z jk | q q! 1/q ≤ 3 1+1/q 2 3+1/q (q/e) max j Σ jj + 2 3+1/q q/e max j EX j · max j Σ jj + 4 max j Σ jj (q!) 1/q .
Since all three coefficients involving q are decreasing in q ≥ 2, we have sup q≥2 E|Z jk | q q!
A tail bound for the sample average of V 2 jk can be similarly derived, and the result follows.

C Simulation Results for Erdös-Rényi Graphs
Here we show simulation results for Erdös-Rényi (ER) graphs defined below under the same settings as in Section 7. Interpretations are similar to the SUB graphs and are thus omitted. For a choice of π ∈ (0, 1), we generate a graph by independently including each possible edge with probability π. Non-zero entries in the lower-triangular part of the matrix K 0 are independent draws from the uniform distribution on [0.5, 1]. The matrix is then symmetrized, and the diagonal elements are set such that K 0 has minimum eigenvalue 0.1. Parameter π is chosen to be 0.08 for n = 1000, and 0.02 for n = 80.     . Each curve corresponds to a different choice of λ K /λ η . Squares indicate models picked by eBIC with refit. The square with black outline has the highest eBIC among all models (combinations of λ K , λη). The multipliers correspond to medium or high for n = 80, and low, medium and high for n = 1000, respectively.